29 research outputs found

    Making Social Dynamics and Content Evolution Transparent in Collaboratively Written Text

    Get PDF
    This dissertation presents models and algorithms for accurately and efficiently extracting data from revisioned content in Collaborative Writing Systems about (i) the provenance and history of specific sequences of text, as well as (ii) interactions between editors via the content changes they perform, especially disagreement. Visualization tools are presented to gain further insights into the extracted data. Collaboration mechanisms to be researched with these new data and tools are discussed

    "I updated the <ref>": The evolution of references in the English Wikipedia and the implications for altmetrics

    Get PDF
    With this work, we present a publicly available data set of the history of all the references (more than 55 million) ever used in the English Wikipedia until June 2019. We have applied a new method for identifying and monitoring references in Wikipedia, so that for each reference we can provide data about associated actions: creation, modifications, deletions, and reinsertions. The high accuracy of this method and the resulting data set was confirmed via a comprehensive crowdworker labeling campaign. We use the data set to study the temporal evolution of Wikipedia references as well as users’ editing behavior. We find evidence of a mostly productive and continuous effort to improve the quality of references: There is a persistent increase of reference and document identifiers (DOI, PubMedID, PMC, ISBN, ISSN, ArXiv ID) and most of the reference curation work is done by registered humans (not bots or anonymous editors). We conclude that the evolution of Wikipedia references, including the dynamics of the community processes that tend to them, should be leveraged in the design of relevance indexes for altmetrics, and our data set can be pivotal for such an effort

    Characterizing the Global Crowd Workforce: A Cross-Country Comparison of Crowdworker Demographics

    Full text link
    Micro-task crowdsourcing is an international phenomenon that has emerged during the past decade. This paper sets out to explore the characteristics of the international crowd workforce and provides a cross-national comparison of the crowd workforce in ten countries. We provide an analysis and comparison of demographic characteristics and shed light on the significance of micro-task income for workers in different countries. This study is the first large-scale country-level analysis of the characteristics of workers on the platform Figure Eight (formerly CrowdFlower), one of the two platforms dominating the micro-task market. We find large differences between the characteristics of the crowd workforces of different countries, both regarding demography and regarding the importance of micro-task income for workers. Furthermore, we find that the composition of the workforce in the ten countries was largely stable across samples taken at different points in time

    Wikiwhere: an interactive tool for studying the geographical provenance of Wikipedia references

    Full text link
    Wikipedia articles about the same topic in different language editions are built around different sources of information. For example, one can find very different news articles linked as references in the English Wikipedia article titled "Annexation of Crimea by the Russian Federation" than in its German counterpart (determined via Wikipedia's language links). Some of this difference can of course be attributed to the different language proficiencies of readers and editors in separate language editions, yet, although including English-language news sources seems to be no issue in the German edition, English references that are listed do not overlap highly with the ones in the article's English version. Such patterns could be an indicator of bias towards certain national contexts when referencing facts and statements in Wikipedia. However, determining for each reference which national context it can be traced back to, and comparing the link distributions to each other is infeasible for casual readers or scientists with non-technical backgrounds. Wikiwhere answers the question where Web references stem from by analyzing and visualizing the geographic location of external reference links that are included in a given Wikipedia article. Instead of relying solely on the IP location of a given URL our machine learning models consider several features

    Wikiwhere: an interactive tool for studying the geographical provenance of Wikipedia references

    Full text link
    Wikipedia articles about the same topic in different language editions are built around different sources of information. For example, one can find very different news articles linked as references in the English Wikipedia article titled "Annexation of Crimea by the Russian Federation" than in its German counterpart (determined via Wikipedia's language links). Some of this difference can of course be attributed to the different language proficiencies of readers and editors in separate language editions, yet, although including English-language news sources seems to be no issue in the German edition, English references that are listed do not overlap highly with the ones in the article's English version. Such patterns could be an indicator of bias towards certain national contexts when referencing facts and statements in Wikipedia. However, determining for each reference which national context it can be traced back to, and comparing the link distributions to each other is infeasible for casual readers or scientists with non-technical backgrounds. Wikiwhere answers the question where Web references stem from by analyzing and visualizing the geographic location of external reference links that are included in a given Wikipedia article. Instead of relying solely on the IP location of a given URL our machine learning models consider several features

    Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

    Get PDF
    Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts. In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media

    The role of the information environment during the first COVID-19 wave in Germany

    Get PDF
    The COVID-19 pandemic has been accompanied by intense debates about the role of the information environment. On the one hand, citizens learn from public information campaigns and news coverage and supposedly adjust their behaviours accordingly; on the other, there are fears of widespread misinformation and its detrimental effects. Analyzing the posts of the most important German information providers published via Facebook, this paper first identifies a uniform salience of subtopics related to COVID-19 across different types of information sources that generally emphasized the threats to public health. Next, using a large survey conducted with German residents during the first COVID-19 wave in March 2020 we investigate how information exposure relates to perceptions, attitudes and behaviours concerning the pandemic. Regression analyses show that getting COVID-19-related information from a multitude of sources has a statistically significant and positive relationship with public health outcomes. These findings are consistent even across the ideological left/right spectrum and party preferences. These consistent correlational results demonstrate that during the first wave of COVID-19, a uniform information environment went hand in hand with a cautious public and widely accepted mitigation measures. Nonetheless, we discuss these findings against the backdrop of an increased politicization of public-health measures during later COVID-19 waves
    corecore